{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 14 - Hypothesis testing with multiple categories\n", "\n", "In Labs 12 and 13, we looked at categorical data with two categories (Smoker/Non-smoker, Black/Non-Black, Purple/White). In this lab we will learn how to test hypotheses when our categorical data has more than two categories.\n", "\n", "First, let's import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Jury Panels in Alameda Country 2009-2010\n", "\n", "We will look at jury panel data from 2009 and 2010 collected by the American Civil Liberties Union (ACLU). The total number of people who reported for jury duty in those years was 1,452. See [11.2 Multiple Categories](https://www.inferentialthinking.com/chapters/11/2/Multiple_Categories.html) for more information.\n", "\n", "We can create a dataframe with this data as shown below. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# create a dictionary listing each column, followed by the column values in a list\n", "jury_data = {\"Eligible\":[0.15, 0.18, 0.12, 0.54, 0.01],\n", " \"Panels\":[0.26, 0.08, 0.08, 0.54, 0.04]}\n", "# pass the dictionary into the dataframe creation function as a parameter\n", "# also pass in labels for the rows\n", "jury = pd.DataFrame(data = jury_data, index = [\"Asian\",\"Black\",\"Latinx\",\"White\",\"Other\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think are the columns and rows of the new dataframe? Check your answer by displaying the dataframe `jury`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now make a bar chart of the two columns of data. Because our dataframe `jury` already contains counts of the different categories, we do not have to use the `value_count()` function. Instead, type `jury.plot(kind = \"bar\")` below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "How does the distribution of eligible jurors compare with the distribution of the jury panels?\n", "\n", "To know whether the variation between the distributions is just the result of chance, we can compute a random sample from the eligible distribution and compare it with the panel distribution.\n", "\n", "As in Labs 12 and 13, create a variable for the eligible population and distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " population = [\"Asian\",\"Black\",\"Latinx\",\"White\",\"Other\"]\n", "pop_prob = [0.15, 0.18, 0.12, 0.54, 0.01]\n", "
\n", "\n", "Create a sample from this population. The sample should be the same size as our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " sample = np.random.choice(population,p = pop_prob,size = 1453)\n", "
\n", "\n", "Compute the value counts for the sample. Since we only have the probabilities of the eligible population, we want to compute the value counts as probabilities as well. We can do this by adding the parameter `normalize = True`. Save the probabilities of the sample in the variable `sample_probs`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " sample_probs = pd.Series(sample).value_counts(normalize = True)\n", "\n", "
\n", "\n", "Next we will create a new column in our dataframe called `Random` that contains the probabilities from our random sample. To do this, type `jury[\"Random\"] = sample_probs` below and run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the `jury` dataframe again to check that the column was added." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the bar chart of the dataframe again, and the new column will be included." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " jury.plot(kind = \"bar\")\n", "
\n", "\n", "How does the distribution of the random sample compare to the eligible distribution? To the panel distribution?\n", "\n", "Let's compare the panels distribution to the eligible distribution quantitatively using hypothesis testing. We need to choose a statistic to simulate, called the *test statistic*. In this problem, we will use something called the *Total Variation Distance (TVD)* as the test statistic. The TVD measures the difference between two distributions. \n", "\n", "First we will compute the TVD between the panels and eligible distribution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "jury[\"Difference\"] = jury[\"Panels\"] - jury[\"Eligible\"]\n", "jury" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the sum of the difference column? \n", "\n", "To fix this, we will take the absolute differences between probabilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "jury[\"Absolute Difference\"] = np.abs(jury[\"Difference\"])\n", "jury" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does this do?\n", "\n", "Now take the sum of the absolute difference column. You can use the same command as when we took the sum of a filter." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "jury[\"Absolute Difference\"].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " jury[\"Absolute Difference\"].sum()\n", "\n", "
\n", "\n", "Notice this sum is twice either the positive or negative count, so we divide it by two. This quantity is the *total variation distance (TVD)* between the distribution of ethnicity in the eligible juror population and the panel." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data_tvd = jury[\"Absolute Difference\"].sum()/2\n", "data_tvd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could have done this calculation in one line of code:\n", "`np.abs(jury[\"Panel\"] - jury[\"Eligible\"]).sum()/2`\n", "Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we want to understand the distribution of the test statistic (here, the TVD) if the panels were actually from the eligible distribution. To do this, we want to simulate random samples from the eligible distribution and compute the total variation distance between the sample and eligible distribution.\n", "\n", "First compute for one random sample, and compute the TVD between its probabilities and the eligible distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " sample = np.random.choice(population,p = pop_prob,size = 1453)\n", "sample_probs = pd.Series(sample).value_counts(normalize = True)\n", "jury[\"Random\"] = sample_probs\n", "sample_tvd = np.abs(jury[\"Random\"] - jury[\"Eligible\"]).sum()/2\n", "\n", "
\n", "\n", "Now we want to repeat this process many times, and make a histogram of the difference TVD values. First, use a loop to generate many samples and compute the TVD to the eligible population." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " tvds = []\n", "for i in range(10000):\n", " sample = np.random.choice(population,p = pop_prob,size = 1453)\n", " sample_probs = pd.Series(sample).value_counts(normalize = True)\n", " jury[\"Random\"] = sample_probs\n", " sample_tvd = np.abs(jury[\"Random\"] - jury[\"Eligible\"]).sum()/2\n", " tvds.append(sample_tvd)\n", "\n", "
\n", "\n", "Next make the histogram of these simulated test statistics (the TVDs)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " pd.Series(tvds).hist()\n", "\n", "
\n", "\n", "Does the test statistic computed from the data look like it comes from this distribution?\n", "\n", "### The borough distribution of 311 calls revisited\n", "\n", "In Lab 11, we qualitatively compared the distribution of the boroughs for Sunday March 3 and Monday March 4 for 311 complaints and service requests. We will now use hypothesis testing to quantitatively compare these distributions.\n", "\n", "Null hypothesis: The distribution of boroughs for which 311 calls are made is the same on Sunday and Monday.\n", "\n", "Alternative hypothesis: The distribution of boroughs for which 311 calls are made is different on Sunday than on Monday.\n", "\n", "Load your CSV file with call data from March 3 and 4, 2019 into the dataframe `calls`. Read the `Created Date` column in as a date/time." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the `calls` dataframe to make sure it was loaded into memory correctly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code from Lab 11 to create the overlapping bar charts is below." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.6" } }, "nbformat": 4, "nbformat_minor": 2 }